This document analyzes how different color scales (sequential and
diverging) affect the perception of correlation strength in heatmaps.
Using the diamonds dataset from the ggplot2
package, we will:
RdBu) to emphasize both the direction (positive/negative)
and magnitude of correlations.Viridis) to emphasize the strength of correlations,
regardless of their direction.To begin, we load the necessary libraries and prepare the dataset for
analysis. The diamonds dataset from the
ggplot2 package contains a mix of numerical and categorical
columns. Since correlation requires numerical data, we filter the
dataset to include only the numerical columns. This ensures that we can
compute meaningful pairwise correlations.
# Load necessary libraries
library(ggplot2) # For the diamonds dataset
library(plotly) # For creating interactive heatmaps
library(dplyr) # For data manipulation
# Load the diamonds dataset
data(diamonds)
# Select only numerical columns
diamonds_numeric <- diamonds[, sapply(diamonds, is.numeric)]
# Display the first few rows of the numerical data
head(diamonds_numeric) # Preview the dataset
## # A tibble: 6 × 7
## carat depth table price x y z
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 61.5 55 326 3.95 3.98 2.43
## 2 0.21 59.8 61 326 3.89 3.84 2.31
## 3 0.23 56.9 65 327 4.05 4.07 2.31
## 4 0.29 62.4 58 334 4.2 4.23 2.63
## 5 0.31 63.3 58 335 4.34 4.35 2.75
## 6 0.24 62.8 57 336 3.94 3.96 2.48
In this code block, we loaded three libraries: ggplot2
for accessing the diamonds dataset, plotly for generating
interactive heatmaps, and dplyr for manipulating data
easily. The diamonds dataset is filtered to include only numeric columns
using sapply(diamonds, is.numeric), allowing us to
calculate correlations in the next step.
In this step, we calculate the correlation matrix for the numerical columns selected from the diamonds dataset. The correlation matrix provides pairwise correlation coefficients, showing the strength and direction of the relationships between variables.
# Calculate the correlation matrix for numerical columns
correlation_matrix <- cor(diamonds_numeric, use = "complete.obs")
# Print the correlation matrix
print(correlation_matrix) # Display the correlation values
## carat depth table price x y
## carat 1.00000000 0.02822431 0.1816175 0.9215913 0.97509423 0.95172220
## depth 0.02822431 1.00000000 -0.2957785 -0.0106474 -0.02528925 -0.02934067
## table 0.18161755 -0.29577852 1.0000000 0.1271339 0.19534428 0.18376015
## price 0.92159130 -0.01064740 0.1271339 1.0000000 0.88443516 0.86542090
## x 0.97509423 -0.02528925 0.1953443 0.8844352 1.00000000 0.97470148
## y 0.95172220 -0.02934067 0.1837601 0.8654209 0.97470148 1.00000000
## z 0.95338738 0.09492388 0.1509287 0.8612494 0.97077180 0.95200572
## z
## carat 0.95338738
## depth 0.09492388
## table 0.15092869
## price 0.86124944
## x 0.97077180
## y 0.95200572
## z 1.00000000
The cor function computes the correlations between all
pairs of numerical variables in the dataset. The
use = "complete.obs" argument ensures that any rows with
missing values are excluded from the computation. This step results in a
matrix where rows and columns correspond to the variables, and the
values represent the correlation coefficients between them.
Next, we visualize the correlation matrix using a diverging color
scale (RdBu). Diverging color scales are particularly
useful for highlighting the direction of correlations. Strong positive
correlations appear in one color (e.g., red), strong negative
correlations appear in another (e.g., blue), and neutral correlations
are displayed as white.
# Create a heatmap with a diverging color scale (RdBu)
plot_ly(
x = colnames(correlation_matrix), # X-axis: variable names
y = rownames(correlation_matrix), # Y-axis: variable names
z = correlation_matrix, # Z-axis: correlation values
type = "heatmap", # Heatmap type
colorscale = "RdBu", # Diverging color scale
text = round(correlation_matrix, 2), # Display rounded correlation values
texttemplate = "%{text}" # Text formatting
) %>%
layout(
title = "Heatmap with Diverging Color Scale (RdBu)",
xaxis = list(title = "Variables"),
yaxis = list(title = "Variables")
)
This block creates an interactive heatmap using the
plot_ly function. The colorscale = "RdBu"
argument applies a diverging color scheme, where red and blue emphasize
positive and negative correlations, respectively. Neutral correlations
are represented by white, making them easily distinguishable. Rounded
correlation values are displayed on the heatmap for clarity.
We now create another heatmap, this time using a sequential color
scale (Viridis). Sequential color scales emphasize the
strength of correlations without differentiating between positive and
negative values.
# Create a heatmap with a sequential color scale (Viridis)
plot_ly(
x = colnames(correlation_matrix), # X-axis: variable names
y = rownames(correlation_matrix), # Y-axis: variable names
z = correlation_matrix, # Z-axis: correlation values
type = "heatmap", # Heatmap type
colorscale = "Viridis", # Sequential color scale
text = round(correlation_matrix, 2), # Display rounded correlation values
texttemplate = "%{text}" # Text formatting
) %>%
layout(
title = "Heatmap with Sequential Color Scale (Viridis)",
xaxis = list(title = "Variables"),
yaxis = list(title = "Variables")
)
This code uses the same approach as before but applies a sequential
color scale (Viridis). In this scale, darker colors
represent lower correlation values, and brighter colors represent higher
values. It is particularly effective for highlighting the magnitude of
relationships but does not indicate their direction.
To complement the visual analysis, we identify the strongest positive and negative correlations in the dataset. This involves reshaping the correlation matrix into a long format and sorting the values to find the top correlations.
# Convert the correlation matrix into a data frame for sorting
correlation_df <- as.data.frame(as.table(correlation_matrix)) %>%
filter(Var1 != Var2) %>% # Exclude diagonal (self-correlations)
arrange(desc(Freq)) # Sort by correlation values
# Top 5 positive correlations
top_positive <- correlation_df %>% filter(Freq > 0) %>% head(5)
# Top 5 negative correlations
top_negative <- correlation_df %>% filter(Freq < 0) %>% tail(5)
# Display results
print("Top 5 Positive Correlations:")
## [1] "Top 5 Positive Correlations:"
print(top_positive)
## Var1 Var2 Freq
## 1 x carat 0.9750942
## 2 carat x 0.9750942
## 3 y x 0.9747015
## 4 x y 0.9747015
## 5 z x 0.9707718
print("Top 5 Negative Correlations:")
## [1] "Top 5 Negative Correlations:"
print(top_negative)
## Var1 Var2 Freq
## 4 depth x -0.02528925
## 5 y depth -0.02934067
## 6 depth y -0.02934067
## 7 table depth -0.29577852
## 8 depth table -0.29577852
In this step, we convert the correlation matrix into a long-format
data frame using as.table. This allows us to filter out
self-correlations (diagonal elements where variables are correlated with
themselves) and sort the correlations in descending order. We then
extract the top 5 strongest positive and negative correlations for
further interpretation.
This analysis highlights the impact of different color scales on the
interpretation of correlation heatmaps. The diverging color scale
(RdBu) effectively distinguishes between positive and
negative correlations, making it ideal for identifying the direction of
relationships. However, it may be visually overwhelming due to its
strong contrasts. In contrast, the sequential color scale
(Viridis) provides a smoother gradient that emphasizes the
strength of correlations but does not differentiate their direction.
The choice of color scale depends on the goals of the analysis. Use diverging scales when the direction of correlations is important and sequential scales when focusing solely on the magnitude of relationships.